Chapter 18 - Exercises¶

In [1]:
library(tidyverse)
library(bayesrules)
library(bayesplot)
library(rstan)
library(rstanarm)
library(broom.mixed)
library(tidybayes)
library(forcats)
library(pROC)
options(mc.cores = parallel::detectCores())
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
This is bayesplot version 1.10.0

- Online documentation and vignettes at mc-stan.org/bayesplot

- bayesplot theme set to bayesplot::theme_default()

   * Does _not_ affect other ggplot2 plots

   * See ?bayesplot_theme_set for details on theme setting

Loading required package: StanHeaders

rstan (Version 2.21.8, GitRev: 2e1f913d3ca3)

For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)


Attaching package: ‘rstan’


The following object is masked from ‘package:tidyr’:

    extract


Loading required package: Rcpp

This is rstanarm version 2.21.4

- See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!

- Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.

- For execution on a local, multicore CPU with excess RAM we recommend calling

  options(mc.cores = parallel::detectCores())


Attaching package: ‘rstanarm’


The following object is masked from ‘package:rstan’:

    loo


Type 'citation("pROC")' for a citation.


Attaching package: ‘pROC’


The following objects are masked from ‘package:stats’:

    cov, smooth, var


Exercise 18.1¶

a)¶

In [2]:
coffee_data <- coffee_ratings %>% 
    select(species, flavor) %>%
    mutate( species=(species=="Robusta") )
sample_n( coffee_data, 10 )
A tibble: 10 × 2
speciesflavor
<lgl><dbl>
FALSE7.58
FALSE7.08
FALSE7.75
FALSE7.58
FALSE6.83
FALSE7.83
FALSE7.67
FALSE7.42
FALSE7.50
FALSE8.08
In [3]:
options(repr.plot.width=15, repr.plot.height=5)
ggplot( coffee_data, aes(x=flavor, fill=species) ) + geom_density( alpha=0.5 )

Fit a logistic regression model with

$$Y_i | \beta_0, \beta_1 \sim \text{Bern}(\pi_i),\quad \text{with } \log\left(\frac{\pi_i}{1-\pi_i}\right) = \beta_0 + \beta_1 X_i,$$$$\beta_{0,c} \sim N(m_0, s_0^2),$$$$\beta_{1} \sim N(m_1, s_1^2)$$

b)¶

In [4]:
ggplot( trees, aes(x=Height, y=Girth) ) + geom_point() + geom_smooth( method="lm" )
`geom_smooth()` using formula = 'y ~ x'

Fit a linear regression model with

$$Y_i | \beta_0, \beta_1, \sigma \sim N(\mu_i, \sigma^2),\quad \text{with } \mu_i = \beta_0 + \beta_1 X_i,$$$$\beta_{0,c} \sim N(m_0, s_0^2),$$$$\beta_{1} \sim N(m_1, s_1^2)$$$$\sigma \sim \text{Exp}(l)$$

c)¶

In [5]:
head( radon )
A data.frame: 6 × 4
floorcountylog_radonlog_uranium
<int><fct><dbl><dbl>
11AITKIN0.83290912-0.6890476
20AITKIN0.83290912-0.6890476
30AITKIN1.09861229-0.6890476
40AITKIN0.09531018-0.6890476
50ANOKA 1.16315081-0.8473129
60ANOKA 0.95551145-0.8473129

Uranium levels are given on the county level, Radon values on house level. Each county has thus a specific distribution for log_radon that may depend on log_uranium. If the county is known, the Uranium level is irrelevant, as we know the distribution of the radon level, however if the county is not known, the Uranium level might help as to determine the Radon level. This is a bit of a special situation, but it could nevertheless be encoded with a hierarchical model with random intercepts, the intercepts adapting to the contribution by the Uranium level to predict the mean log_radon concentration.

Distribution of log radon values per county:

In [6]:
options(repr.plot.width=15, repr.plot.height=10)
ggplot( radon, aes(x=log_radon, y=fct_reorder(county, log_radon, .fun='mean') ) ) + geom_boxplot()

Relationship between log uranium concentration and mean log radon concentration per county:

In [7]:
options(repr.plot.width=15, repr.plot.height=5)
radon %>% 
    group_by( county ) %>% 
    summarize( log_uranium=mean(log_uranium), mean_log_radon=mean(log_radon) ) %>% 
    ggplot( aes(x=log_uranium, y=mean_log_radon) ) +
    geom_point() +
    geom_smooth(method="lm")
`geom_smooth()` using formula = 'y ~ x'

There is a more or less linear relationship between county-wide Radon concentration mean and county-ide Uranium concentration.

Variant of a normal hierarchical model with random intercepts:

$$Y_{ij} | \beta_{0j}, \beta_1, \sigma_y \sim N(\mu_j, \sigma_y^2), \quad \text{with } \mu_{j} = {\beta_{0j} + \beta_1 X_{j}}$$$$\beta_{0j}|\beta_0, \sigma_0 \sim N(\beta_0, \sigma_0^2)$$$$\beta_{0c}\sim N(m_0, s_0^2)$$$$\beta_{1}\sim N(m_1, s_1^2)$$$$\sigma_y \sim \text{Exp}(l_y)$$$$\sigma_0 \sim \text{Exp}(l_0)$$

Here $Y_{ij}$ are the individual log radon values $i$ for county $j$ and $X_{j}$ the individual log uranium values for county $j$. Notice the special form of the model with $X_j$ and $\mu_j$ being only county-dependent. In the next chapter of the book you will learn that $X_j$ is a group-level predictor.

d)¶

In [8]:
head( roaches )
A data.frame: 6 × 5
yroach1treatmentseniorexposure2
<int><dbl><int><int><dbl>
1153308.00100.800000
2127331.25100.600000
3 7 1.67101.000000
4 7 3.00101.000000
5 0 2.00101.142857
6 0 0.00101.000000
In [9]:
dim( roaches )
  1. 262
  2. 5
In [10]:
ggplot( roaches, aes(x=y, fill=as.factor(treatment) ) ) + geom_density( alpha = 0.5 )

Fit a linear regression model with

$$Y_i | \beta_0, \beta_1, \sigma \sim N(\mu_i, \sigma^2),\quad \text{with } \mu_i = \beta_0 + \beta_1 X_i,$$$$\beta_{0,c} \sim N(m_0, s_0^2),$$$$\beta_{1} \sim N(m_1, s_1^2)$$$$\sigma \sim \text{Exp}(l)$$

where $X_i$ is 1 if the appartment has received pest control treatment and 0 otherwise.

Exercise 18.2¶

In [11]:
head( book_banning )
A data.frame: 6 × 17
titlebook_idauthordateyearremovedexplicitantifamilyoccultlanguagelgbtqviolentstatepolitical_value_indexmedian_incomehs_grad_ratecollege_grad_rate
<chr><fct><chr><date><dbl><fct><fct><fct><fct><fct><fct><fct><chr><dbl><dbl><dbl><dbl>
1House of the Spirits, The 927 Allende, Isabel 2005-04-0120050101111AK-13.415707.58.7380420.6762701
2It's Not the Stork!: A Book About Girls, Boys, Babies and Bodies1024Harris, Robie 2008-02-0620081000000AK-13.415707.58.7380420.6762701
3King Stork 1087Pyle, Howard 2008-10-0220080000001AK-13.415707.58.7380420.6762701
4How They Met and Other Stories 936 Levithan, David 2008-10-0520080000000AK-13.415707.58.7380420.6762701
5Ghost in the Shell 764 Masamune, Shirow2008-10-0220080000000AK-13.415707.58.7380420.6762701
6King Stork 1087Pyle, Howard 2003-09-1320030000001AK-13.415707.58.7380420.6762701
In [12]:
dim( book_banning )
  1. 931
  2. 17

I prefer numeric values for the reasons:

In [13]:
book_banning <- book_banning %>% 
    mutate_at( vars(removed:violent), \(x) {as.numeric(x)-1} )

a)¶

In the US, each state has very different political views and corresponding laws. In conservative states, anti-family books might be banned more easily and in democratic states, discriminating books might be banned more easily. Examine this with a plot:

In [14]:
book_banning %>% 
    group_by( state ) %>% 
    summarize( removal_rate = mean( removed ) ) %>% 
    ggplot( aes(x=fct_reorder(state, removal_rate), y=removal_rate) ) + 
    geom_point() +
    xlab("State") +
    ylab("Mean removal rate")

The average removal rates differ significantly! What about the proportion of reasons?

In [15]:
book_banning  %>% 
    select( removed:state ) %>% 
    filter( removed==1 ) %>%
    select( -removed ) %>% 
    group_by( state ) %>% 
    summarize_all( sum ) %>% 
    pivot_longer( explicit:violent, names_to="reason", values_to="count" ) %>% 
    ggplot( aes(x=state, y=count, fill=reason) ) +
    geom_bar( stat="identity" )

b)¶

$$Y_{ij} | \beta_{0j}, \beta_1, \beta_2, \beta_3 \sim \text{Bern}(\pi_{ij}), \quad \text{with } \log\left(\frac{\pi_{ij}}{1-\pi_{ij}}\right) = {\beta_{0j} + \beta_1 X_{ij1} + \beta_2 X_{ij2}} + \beta_3 X_{ij3}$$$$\beta_{0j}|\beta_0, \sigma_0 \sim N(\beta_0, \sigma_0^2)$$$$\beta_{0c}\sim N(m_0, s_0^2)$$$$\beta_{1}\sim N(m_1, s_1^2)$$$$\beta_{2}\sim N(m_2, s_2^2)$$$$\beta_{3}\sim N(m_3, s_3^2)$$$$\sigma_0 \sim \text{Exp}(l_0)$$

Here $Y_{ij}$ indicates whether a book is rejected or not and $X_{ij1}$, $X_{ij2}$ and $X_{ij3}$ are dummy variables that encode whether the book was classified as violent, antifamily or language.

c)¶

In [16]:
book_banning %>% 
    count( state ) %>% 
    arrange( desc(n) )
A data.frame: 47 × 2
staten
<chr><int>
PA148
OR118
CO 81
CA 50
IL 39
MI 36
VA 34
OH 29
FL 26
NY 25
OK 23
IN 20
NC 20
MN 18
IA 16
SC 15
ID 14
GA 13
KS 13
KY 13
TN 13
VT 13
AZ 12
LA 12
MO 11
NJ 11
AL 10
WI 10
AK 9
WA 9
MA 8
MT 8
NH 7
CT 6
ND 6
AR 5
MD 5
WY 4
ME 3
NM 3
RI 3
SD 3
WV 3
DE 2
NE 2
MS 1
UT 1

Pennsylvania had most challenges (148) and Utah and Mississipi the least (1).

d)¶

In [17]:
book_banning %>% 
    group_by( state ) %>% 
    summarize( removal_rate = mean( removed ), count = n() ) %>% 
    arrange( desc(removal_rate) )
A tibble: 47 × 3
stateremoval_ratecount
<chr><dbl><int>
MS1.00000000 1
ID0.85714286 14
KS0.76923077 13
MO0.72727273 11
NM0.66666667 3
VA0.52941176 34
AL0.50000000 10
DE0.50000000 2
NY0.44000000 25
AZ0.41666667 12
LA0.41666667 12
SC0.40000000 15
TN0.38461538 13
MA0.37500000 8
OK0.34782609 23
OH0.34482759 29
ND0.33333333 6
RI0.33333333 3
SD0.33333333 3
WA0.33333333 9
CA0.32000000 50
FL0.30769231 26
GA0.30769231 13
IL0.30769231 39
IN0.30000000 20
WI0.30000000 10
NH0.28571429 7
IA0.25000000 16
WY0.25000000 4
NC0.20000000 20
MI0.19444444 36
NJ0.18181818 11
CT0.16666667 6
KY0.15384615 13
CO0.12345679 81
AK0.11111111 9
PA0.07432432148
MN0.05555556 18
OR0.04237288118
AR0.00000000 5
MD0.00000000 5
ME0.00000000 3
MT0.00000000 8
NE0.00000000 2
UT0.00000000 1
VT0.00000000 13
WV0.00000000 3

Mississipi has the highest removal rate with 100%, however only one book was challenged, Idaho has a removal rate of 86% with 14 challenged books, what is probably more significant. 7 states have a removal rate of 0%, Vermote the most significant with in total 13 challenged books that were not removed.

e)¶

Three different plots (could probably be done in one)

In [18]:
book_banning %>% 
    group_by( removed, language ) %>% 
    summarize( count=n() ) %>% 
    mutate( proportion=count/sum(count) )  %>% 
    mutate( removed=as.factor(removed), language=as.factor(language) )  %>% 
    ggplot( aes(x=removed, y=proportion, fill=language) ) +
    geom_bar( stat="identity" )
`summarise()` has grouped output by 'removed'. You can override using the
`.groups` argument.
In [19]:
book_banning %>% 
    group_by( removed, antifamily ) %>% 
    summarize( count=n() ) %>% 
    mutate( proportion=count/sum(count) )  %>% 
    mutate( removed=as.factor(removed), antifamily=as.factor(antifamily) )  %>% 
    ggplot( aes(x=removed, y=proportion, fill=antifamily) ) +
    geom_bar( stat="identity" )
`summarise()` has grouped output by 'removed'. You can override using the
`.groups` argument.
In [20]:
book_banning %>% 
    group_by( removed, violent ) %>% 
    summarize( count=n() ) %>% 
    mutate( proportion=count/sum(count) )  %>% 
    mutate( removed=as.factor(removed), violent=as.factor(violent) )  %>% 
    ggplot( aes(x=removed, y=proportion, fill=violent) ) +
    geom_bar( stat="identity" )
`summarise()` has grouped output by 'removed'. You can override using the
`.groups` argument.

All predictors give only weak information, probably our model will not explain too much variance (grouping by state might however help).

Exercise 18.3¶

a)¶

In [21]:
book_model_1 <- stan_glmer(
  removed ~ language + antifamily + violent + (1 | state), 
  data = book_banning, family = binomial,
  prior_intercept = normal(0, 2.5, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735
)

MCMC diagnostics:

In [22]:
options(repr.plot.width=15, repr.plot.height=15)
mcmc_trace(book_model_1, size = 0.1)
mcmc_dens_overlay(book_model_1)
mcmc_acf(book_model_1)
neff_ratio(book_model_1)
rhat(book_model_1)
Warning message:
“The `facets` argument of `facet_grid()` is deprecated as of ggplot2 2.2.0.
ℹ Please use the `rows` argument instead.
ℹ The deprecated feature was likely used in the bayesplot package.
  Please report the issue at <https://github.com/stan-dev/bayesplot/issues/>.”
(Intercept)
0.3338
language
1.38375
antifamily
1.4253
violent
1.3393
b[(Intercept) state:AK]
1.23365
b[(Intercept) state:AL]
1.2108
b[(Intercept) state:AR]
1.2528
b[(Intercept) state:AZ]
1.12865
b[(Intercept) state:CA]
0.6435
b[(Intercept) state:CO]
0.73565
b[(Intercept) state:CT]
1.3203
b[(Intercept) state:DE]
1.4982
b[(Intercept) state:FL]
0.94995
b[(Intercept) state:GA]
1.15085
b[(Intercept) state:IA]
1.12585
b[(Intercept) state:ID]
0.9591
b[(Intercept) state:IL]
0.77745
b[(Intercept) state:IN]
1.0395
b[(Intercept) state:KS]
1.0109
b[(Intercept) state:KY]
1.2111
b[(Intercept) state:LA]
1.17775
b[(Intercept) state:MA]
1.3127
b[(Intercept) state:MD]
1.17435
b[(Intercept) state:ME]
1.36175
b[(Intercept) state:MI]
0.959
b[(Intercept) state:MN]
1.18275
b[(Intercept) state:MO]
1.0744
b[(Intercept) state:MS]
1.35615
b[(Intercept) state:MT]
1.1505
b[(Intercept) state:NC]
1.12545
b[(Intercept) state:ND]
1.37965
b[(Intercept) state:NE]
1.44295
b[(Intercept) state:NH]
1.3853
b[(Intercept) state:NJ]
1.30085
b[(Intercept) state:NM]
1.28125
b[(Intercept) state:NY]
0.87745
b[(Intercept) state:OH]
0.89615
b[(Intercept) state:OK]
0.9158
b[(Intercept) state:OR]
0.9483
b[(Intercept) state:PA]
0.7046
b[(Intercept) state:RI]
1.4587
b[(Intercept) state:SC]
1.0578
b[(Intercept) state:SD]
1.45805
b[(Intercept) state:TN]
1.13525
b[(Intercept) state:UT]
1.4449
b[(Intercept) state:VA]
0.73005
b[(Intercept) state:VT]
1.14925
b[(Intercept) state:WA]
1.2356
b[(Intercept) state:WI]
1.23435
b[(Intercept) state:WV]
1.27895
b[(Intercept) state:WY]
1.4266
Sigma[state:(Intercept),(Intercept)]
0.34655
(Intercept)
1.00026448903032
language
0.999849276524683
antifamily
1.00001291187862
violent
0.999878836926489
b[(Intercept) state:AK]
0.999946394249713
b[(Intercept) state:AL]
0.999951998444166
b[(Intercept) state:AR]
1.00003879185539
b[(Intercept) state:AZ]
0.999946245960188
b[(Intercept) state:CA]
1.00004318911421
b[(Intercept) state:CO]
1.00006419524789
b[(Intercept) state:CT]
0.999924104664861
b[(Intercept) state:DE]
0.99997970502287
b[(Intercept) state:FL]
0.999978519433963
b[(Intercept) state:GA]
0.999911776376923
b[(Intercept) state:IA]
0.999908097716818
b[(Intercept) state:ID]
0.999974334923115
b[(Intercept) state:IL]
0.999989386089176
b[(Intercept) state:IN]
1.00003371799339
b[(Intercept) state:KS]
0.999946355664868
b[(Intercept) state:KY]
0.99992329866792
b[(Intercept) state:LA]
0.999948877125724
b[(Intercept) state:MA]
0.999965591603113
b[(Intercept) state:MD]
0.999941491748651
b[(Intercept) state:ME]
0.999987157237585
b[(Intercept) state:MI]
1.00006629621494
b[(Intercept) state:MN]
0.99993855983935
b[(Intercept) state:MO]
1.00010896094558
b[(Intercept) state:MS]
0.999916934914343
b[(Intercept) state:MT]
0.999877620466894
b[(Intercept) state:NC]
1.00001311824805
b[(Intercept) state:ND]
0.999974017307239
b[(Intercept) state:NE]
1.00005128754074
b[(Intercept) state:NH]
0.999900705052695
b[(Intercept) state:NJ]
0.999875374183709
b[(Intercept) state:NM]
0.99986692174932
b[(Intercept) state:NY]
0.999979261593835
b[(Intercept) state:OH]
0.999996603438034
b[(Intercept) state:OK]
0.999884123565653
b[(Intercept) state:OR]
1.00008273967912
b[(Intercept) state:PA]
1.00002725490451
b[(Intercept) state:RI]
1.0000250086899
b[(Intercept) state:SC]
0.999912394785215
b[(Intercept) state:SD]
0.999938210869951
b[(Intercept) state:TN]
0.999839083454253
b[(Intercept) state:UT]
0.999916967370767
b[(Intercept) state:VA]
1.00000351525619
b[(Intercept) state:VT]
1.00002349539242
b[(Intercept) state:WA]
0.999937946816449
b[(Intercept) state:WI]
1.0000399485041
b[(Intercept) state:WV]
0.999860343676187
b[(Intercept) state:WY]
0.999914814934719
Sigma[state:(Intercept),(Intercept)]
0.999932821011747

Looks good! I find it difficult to bring the autocorrelation plots to a good scale.

b)¶

Posterior median model at 80% level:

In [23]:
tidy(book_model_1, effects = "fixed", conf.int = TRUE, conf.level = 0.80)
A tibble: 4 × 5
termestimatestd.errorconf.lowconf.high
<chr><dbl><dbl><dbl><dbl>
(Intercept)-1.19129060.1941925-1.44911049-0.9411182
language 0.33183950.1903472 0.08802237 0.5753550
antifamily -1.03191760.4835658-1.68526989-0.4450214
violent 0.47342800.2367321 0.17149035 0.7712990
$$\log\left(\frac{p(X_1, X_2, X_3)}{1-p(X_1, X_2, X_3)}\right) = -1.20 + 0.33 X_1 - 1.04 X_2 + 0.48 X_3$$

or in other words

$$p(X_1, X_2, X_3) = \frac{1}{1+exp(-(-1.20 + 0.33 X_1 - 1.04 X_2 + 0.48 X_3))}$$
In [24]:
c( exp(0.33), exp(-1.04), exp(0.48) )
  1. 1.39096812846378
  2. 0.35345468195878
  3. 1.61607440219289

If the book was challenged for inappropriate language, the odds of it being removed increase by almost 40%. If the book contains anti-family material and was challenged, the odds of it being removed decrease by almost 65%. Finally, if the book was challenged for violent content, the odds of it being removed increase by 62%. In the global model it appears that anti-family content prevents a book from being removed.

c)¶

All effects appear to be significant at the 80% level, since none of the 80% credible intervals contain zero.

d)¶

In [25]:
exp(0.33) * exp(0.48)
2.24790798667647

Inappropriate language and violent content and no anti-family content, then the odds for being removed increase by a factor of >2.

Exercise 18.4¶

a)¶

I pick a threshold of 0.5, however one should all check all thresholds and rather argue in terms of AUC.

In [26]:
set.seed(84735)
csummary <- classification_summary(data = book_banning, model = book_model_1, cutoff = 0.5)
csummary
$confusion_matrix
A tabyl: 2 × 3
y01
<dbl><dbl><dbl>
070014
118136
$accuracy_rates
A data.frame: 3 × 1
<dbl>
sensitivity0.1658986
specificity0.9803922
overall_accuracy0.7905478

At a threshold of 0.5, the accuracy of the model is almost 80%, with extremely low sensitivity (again sensitivity could be tuned by tuning the threshold). The accuracy looks impressive, is however due to the class imbalance in the data:

In [27]:
table( book_banning$removed ) / nrow(book_banning)
        0         1 
0.7669173 0.2330827 

Almost 80% of the books are not banned anyway, so a model that predicts not removed all the time would perform very similar to our model.

In [28]:
predictions <- as.numeric( colMeans( posterior_predict( book_model_1, book_banning, type="response" ) ) )
roc_object <- roc( book_banning$removed, predictions )
auc(roc_object)
Setting levels: control = 0, case = 1

Setting direction: controls < cases

0.790045050278176
In [29]:
options(repr.plot.width=10, repr.plot.height=10)
plot(roc_object)

One could dive further into the class imbalance problem here..

b)¶

In [30]:
tidy(book_model_1, effects = "ran_vals", conf.int = TRUE, conf.level = 0.80) %>% 
    filter( level %in% c("KS", "OR") )
A tibble: 2 × 7
levelgrouptermestimatestd.errorconf.lowconf.high
<chr><chr><chr><dbl><dbl><dbl><dbl>
KSstate(Intercept) 1.5306480.5388864 0.864348 2.253114
ORstate(Intercept)-1.7357030.4055810-2.290942-1.242795
In [31]:
c( exp(1.53), exp(-1.73) )
  1. 4.61817682229978
  2. 0.177284409969878

The baseline odds for a book to be removed once challenged in Kansas are 4:1, while they are only 0.18:1 $\approx$ 1:5.5 in Oregon. This makes sense when looking at the data:

In [32]:
book_banning %>% 
    filter( state  %in%  c("KS", "OR") ) %>%
    group_by( state ) %>% 
    summarize( removal_rate = mean( removed ) )
A tibble: 2 × 2
stateremoval_rate
<chr><dbl>
KS0.76923077
OR0.04237288

c)¶

In [33]:
predictions <- posterior_predict( book_model_1, newdata=data.frame( 
    state=c("KS", "OR"), 
    language=c(1,1), 
    antifamily=c(0,0), 
    violent=c(0,0)) 
)

colMeans( predictions )
1
0.65585
2
0.07085

66% of the posterior models predict that the book will be banned in Kansas and 7.1% that it will be banned in Oregon.

Exercise 18.5¶

In [34]:
colnames( basketball )
  1. 'player_name'
  2. 'height'
  3. 'weight'
  4. 'year'
  5. 'team'
  6. 'age'
  7. 'games_played'
  8. 'games_started'
  9. 'avg_minutes_played'
  10. 'avg_field_goals'
  11. 'avg_field_goal_attempts'
  12. 'field_goal_pct'
  13. 'avg_three_pointers'
  14. 'avg_three_pointer_attempts'
  15. 'three_pointer_pct'
  16. 'avg_two_pointers'
  17. 'avg_two_pointer_attempts'
  18. 'two_pointer_pct'
  19. 'avg_free_throws'
  20. 'avg_free_throw_attempts'
  21. 'free_throw_pct'
  22. 'avg_offensive_rb'
  23. 'avg_defensive_rb'
  24. 'avg_rb'
  25. 'avg_assists'
  26. 'avg_steals'
  27. 'avg_blocks'
  28. 'avg_turnovers'
  29. 'avg_personal_fouls'
  30. 'avg_points'
  31. 'total_minutes'
  32. 'starter'
In [35]:
dim( basketball )
  1. 146
  2. 32
In [36]:
basketball %>% count( year )
A tibble: 1 × 2
yearn
<int><int>
2019146

Each row represents the statistics of a player in 2019.

a)¶

Some player names appear in multiple teams:

In [37]:
basketball %>% count( player_name, team ) %>% count( player_name ) %>% filter( n>1 )
A tibble: 6 × 2
player_namen
<chr><int>
Alaina Coates 3
Asia Taylor 3
Bridget Carleton 3
Karlie Samuelson 3
Kristine Anigwe 3
Theresa Plaisance3

It appears that these are team changes within the year. It is not quite clear how to treat these exceptions, but they are few in number, so we neglect them.

In [38]:
basketball %>% summarize( 
    n_player_statistics=n(), 
    n_unique_players=n_distinct(player_name), 
    n_teams=n_distinct(team) 
)
A tibble: 1 × 3
n_player_statisticsn_unique_playersn_teams
<int><int><int>
14613413

b)¶

13, see a)

c)¶

In [39]:
options(repr.plot.width=15, repr.plot.height=5)
ggplot( basketball, aes(x=avg_points, y=total_minutes, color=starter)) + geom_point() + geom_smooth( method="lm" )
`geom_smooth()` using formula = 'y ~ x'

People who score more average points are also more often starters. People with more average points play longer throughout the season. This appears to make sense, better people are probably kept longer in the game. The relationship between total minutes and average points scored is more or less linear with very similar slopes.

d)¶

In [40]:
ggplot( basketball, aes(x=games_played, y=total_minutes, color=starter)) + geom_point() + geom_smooth( method="lm" )
`geom_smooth()` using formula = 'y ~ x'

The more games people played the more total minutes they had. Starters have more total minutes at the same number of games. The relationship between total minutes and games played is roughly linear. The fitted blue model in the plot for starters appears not too trustable at the lower end, probably assuming the same slope for starters and non-starters is reasonable.

Exercise 18.6¶

a)¶

$$Y_{ij} | \beta_{0j}, \beta_1, \beta_2, \beta_3 \sim \text{Pois}(\lambda_{ij}), \quad \text{with } \log\left(\lambda_{ij}\right) = \beta_{0j} + \beta_1 X_{ij1} + \beta_2 X_{ij2} + \beta_3 X_{ij3}$$$$\beta_{0j}|\beta_0, \sigma_0 \sim N(\beta_0, \sigma_0^2)$$$$\beta_{0c}\sim N(m_0, s_0^2)$$$$\beta_{1}\sim N(m_1, s_1^2)$$$$\beta_{2}\sim N(m_2, s_2^2)$$$$\beta_{3}\sim N(m_3, s_3^2)$$$$\sigma_0 \sim \text{Exp}(l_0)$$

Here $Y_{ij}$ indicates the total minutes player $i$ of team $j$ played, $X_{ij1}$ stands for avg_points, $X_{ij2}$ is an indicator variable for starter and $X_{ij3}$ stands for games_played.

b)¶

total_minutes is a positive number, of all the distributions we have seen in this book, only the Poisson and the negative binomial distributions are ready to model outcomes with only positive values.

In [41]:
ggplot( basketball, aes(x=total_minutes, fill=team) ) + geom_density( alpha=0.2 )

It is hard to find a Poisson distribution in here, we can only find out with a posterior predictive check whether a Poisson regression fits or we should use a negative binomial regression (if expection and variance are not the same) - the only models we know in this book to deal with only positive values.

c)¶

The distribution of total_minutes varies quite a bit per team:

In [42]:
ggplot( basketball, aes(x=team, y=total_minutes) ) + geom_boxplot()

The number of team members is not very large for a hierarchical regression models with group-specific intercepts and three different slopes:

In [43]:
ggplot( basketball, aes(x=team) ) + geom_bar()

d)¶

In [44]:
set.seed(84735)
basketball_model_poisson <- stan_glmer(
  total_minutes ~ games_played + starter + avg_points + (1 | team), 
  data = basketball, family = poisson,
  prior_intercept = normal(0, 2.5, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735
)

Diagnostics:

In [45]:
options(repr.plot.width=15, repr.plot.height=15)
mcmc_trace(basketball_model_poisson, size = 0.1)
mcmc_dens_overlay(basketball_model_poisson)
mcmc_acf(basketball_model_poisson)
neff_ratio(basketball_model_poisson)
rhat(basketball_model_poisson)
(Intercept)
0.2339
games_played
0.7364
starterTRUE
0.799
avg_points
0.69445
b[(Intercept) team:ATL]
0.1975
b[(Intercept) team:CHI]
0.213
b[(Intercept) team:CON]
0.20935
b[(Intercept) team:DAL]
0.1961
b[(Intercept) team:IND]
0.207
b[(Intercept) team:LAS]
0.2086
b[(Intercept) team:LVA]
0.21205
b[(Intercept) team:MIN]
0.21895
b[(Intercept) team:NYL]
0.20855
b[(Intercept) team:PHO]
0.2102
b[(Intercept) team:SEA]
0.20755
b[(Intercept) team:TOT]
0.32725
b[(Intercept) team:WAS]
0.20235
Sigma[team:(Intercept),(Intercept)]
0.18185
(Intercept)
1.00152479892502
games_played
1.00030586311495
starterTRUE
1.00010101996334
avg_points
1.00017671109358
b[(Intercept) team:ATL]
1.00147825806349
b[(Intercept) team:CHI]
1.00158886821957
b[(Intercept) team:CON]
1.00182978707018
b[(Intercept) team:DAL]
1.00160126730627
b[(Intercept) team:IND]
1.00168847080641
b[(Intercept) team:LAS]
1.00160026786836
b[(Intercept) team:LVA]
1.0013791817465
b[(Intercept) team:MIN]
1.00166592146192
b[(Intercept) team:NYL]
1.00165040967205
b[(Intercept) team:PHO]
1.00159868335579
b[(Intercept) team:SEA]
1.00149383609735
b[(Intercept) team:TOT]
1.00095182303121
b[(Intercept) team:WAS]
1.00162907328171
Sigma[team:(Intercept),(Intercept)]
1.00058007872748

Looks quite ok at first sight!

Posterior predictive check:

In [46]:
options(repr.plot.width=15, repr.plot.height=5)
pp_check(basketball_model_poisson) + xlab("total minutes")

The hierarchical Poisson model is quite close to reality, but interestingly it is much too confident! Might this be because of the assumption that variance equals expectation?

Exercise 18.7¶

Simulate models¶

Negative binomial model:

In [47]:
set.seed(84735)
basketball_model_negbinom <- stan_glmer(
  total_minutes ~ games_played + starter + avg_points + (1 | team), 
  data = basketball, family = neg_binomial_2,
  prior_intercept = normal(0, 2.5, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735
)

Diagnostics:

In [48]:
options(repr.plot.width=15, repr.plot.height=15)
mcmc_trace(basketball_model_negbinom, size = 0.1)
mcmc_dens_overlay(basketball_model_negbinom)
mcmc_acf(basketball_model_negbinom)
neff_ratio(basketball_model_negbinom)
rhat(basketball_model_negbinom)
(Intercept)
0.85365
games_played
0.931
starterTRUE
0.8979
avg_points
0.8097
b[(Intercept) team:ATL]
0.9233
b[(Intercept) team:CHI]
0.74935
b[(Intercept) team:CON]
0.98265
b[(Intercept) team:DAL]
0.624
b[(Intercept) team:IND]
1.0794
b[(Intercept) team:LAS]
0.6836
b[(Intercept) team:LVA]
0.5636
b[(Intercept) team:MIN]
0.9071
b[(Intercept) team:NYL]
0.54725
b[(Intercept) team:PHO]
0.54245
b[(Intercept) team:SEA]
0.98395
b[(Intercept) team:TOT]
0.9911
b[(Intercept) team:WAS]
0.69485
reciprocal_dispersion
0.88265
Sigma[team:(Intercept),(Intercept)]
0.3108
(Intercept)
1.00005101229445
games_played
1.00023778697039
starterTRUE
0.999901016581922
avg_points
1.00000393225825
b[(Intercept) team:ATL]
0.999979315453369
b[(Intercept) team:CHI]
1.00011673162036
b[(Intercept) team:CON]
1.00031419031425
b[(Intercept) team:DAL]
1.00001034998593
b[(Intercept) team:IND]
1.00008918749112
b[(Intercept) team:LAS]
1.00030775252354
b[(Intercept) team:LVA]
1.00004752728724
b[(Intercept) team:MIN]
0.999979518112909
b[(Intercept) team:NYL]
1.00007553741762
b[(Intercept) team:PHO]
1.00002615027358
b[(Intercept) team:SEA]
1.00016843754311
b[(Intercept) team:TOT]
0.999949138727968
b[(Intercept) team:WAS]
1.00002225597374
reciprocal_dispersion
1.00004413066045
Sigma[team:(Intercept),(Intercept)]
1.00056605481159

Looks reasonably converged.

Normal model:

In [49]:
set.seed(84735)
basketball_model_normal <- stan_glmer(
  total_minutes ~ games_played + starter + avg_points + (1 | team), 
  data = basketball, family = gaussian,
  prior_intercept = normal(0, 2.5, autoscale = TRUE),
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_covariance = decov(reg = 1, conc = 1, shape = 1, scale = 1),
  chains = 4, iter = 5000*2, seed = 84735
)

Diagnostics:

In [50]:
options(repr.plot.width=15, repr.plot.height=15)
mcmc_trace(basketball_model_normal, size = 0.1)
mcmc_dens_overlay(basketball_model_normal)
mcmc_acf(basketball_model_normal)
neff_ratio(basketball_model_normal)
rhat(basketball_model_normal)
(Intercept)
1.2324
games_played
0.9992
starterTRUE
1.0479
avg_points
0.7818
b[(Intercept) team:ATL]
1.0404
b[(Intercept) team:CHI]
1.07525
b[(Intercept) team:CON]
1.1821
b[(Intercept) team:DAL]
1.1338
b[(Intercept) team:IND]
1.16675
b[(Intercept) team:LAS]
1.2708
b[(Intercept) team:LVA]
1.1358
b[(Intercept) team:MIN]
1.1777
b[(Intercept) team:NYL]
0.937
b[(Intercept) team:PHO]
1.0753
b[(Intercept) team:SEA]
1.0099
b[(Intercept) team:TOT]
1.1699
b[(Intercept) team:WAS]
1.0706
sigma
1.4301
Sigma[team:(Intercept),(Intercept)]
0.67865
(Intercept)
1.00000372209353
games_played
1.00021718403319
starterTRUE
0.999897865606012
avg_points
1.0000629386365
b[(Intercept) team:ATL]
0.999849884768928
b[(Intercept) team:CHI]
0.999889033290193
b[(Intercept) team:CON]
0.999865212384296
b[(Intercept) team:DAL]
0.999858434762797
b[(Intercept) team:IND]
0.999905319119609
b[(Intercept) team:LAS]
0.999849578244972
b[(Intercept) team:LVA]
1.00012161777817
b[(Intercept) team:MIN]
0.999948093550712
b[(Intercept) team:NYL]
0.99987007520952
b[(Intercept) team:PHO]
0.999983899427994
b[(Intercept) team:SEA]
1.00010291839486
b[(Intercept) team:TOT]
0.999977735320058
b[(Intercept) team:WAS]
0.999957659515932
sigma
0.999862461377571
Sigma[team:(Intercept),(Intercept)]
0.999919415471527

Looks reasonably converged.

Posterior predictive checks¶

In [51]:
options(repr.plot.width=15, repr.plot.height=5)
pp_check(basketball_model_negbinom) + xlab("total minutes") + xlim(0, 1650)
Warning message:
“Removed 138 rows containing non-finite values (`stat_density()`).”

The negative binomial model is more uncertain than the Poisson model and better captures the data.

In [52]:
pp_check(basketball_model_normal) + xlab("total minutes")

The normal model predicts a substantial amount of negative total minutes. For this reason I would not choose it to model this data, even if it performed the best.

Model evaluation¶

In [53]:
set.seed(84735)
prediction_summary(model = basketball_model_poisson, data = basketball)
A data.frame: 1 × 4
maemae_scaledwithin_50within_95
<dbl><dbl><dbl><dbl>
38.578652.5484350.15753420.3767123
In [54]:
set.seed(84735)
prediction_summary(model = basketball_model_negbinom, data = basketball)
A data.frame: 1 × 4
maemae_scaledwithin_50within_95
<dbl><dbl><dbl><dbl>
37.073280.45448230.69178080.9520548
In [55]:
set.seed(84735)
prediction_summary(model = basketball_model_normal, data = basketball)
A data.frame: 1 × 4
maemae_scaledwithin_50within_95
<dbl><dbl><dbl><dbl>
62.45320.70352390.47260270.9726027

The Poisson model has a similar median absolute error as the negative binomial model, shows however a significantly larger scaled median absolute error and captures only 16% of all datapoints within its 50% credible intervals and only 38% within its 95% credible interval. This model should rather not be used, as it substantially underestimates uncertainty. The negative binomial model performs much better here with more reasonable results for mae_scaled, within_50 and within_95. Interestingly, the normal performs significantly worse in terms of median absolute error, however it performs better than the Poisson model in the other measures.

Differences in terms of ELPD:

In [56]:
set.seed(84735)
elpd1 <- loo(basketball_model_poisson, k_threshold=0.7)
elpd2 <- loo(basketball_model_negbinom, k_threshold=0.7)
elpd3 <- loo(basketball_model_normal, k_threshold=0.7)
loo_compare(elpd1, elpd2, elpd3)
18 problematic observation(s) found.
Model will be refit 18 times.


Fitting model 1 out of 18 (leaving out observation 13)


Fitting model 2 out of 18 (leaving out observation 16)


Fitting model 3 out of 18 (leaving out observation 31)


Fitting model 4 out of 18 (leaving out observation 39)


Fitting model 5 out of 18 (leaving out observation 43)


Fitting model 6 out of 18 (leaving out observation 63)


Fitting model 7 out of 18 (leaving out observation 68)


Fitting model 8 out of 18 (leaving out observation 70)


Fitting model 9 out of 18 (leaving out observation 73)


Fitting model 10 out of 18 (leaving out observation 81)


Fitting model 11 out of 18 (leaving out observation 83)


Fitting model 12 out of 18 (leaving out observation 84)


Fitting model 13 out of 18 (leaving out observation 85)


Fitting model 14 out of 18 (leaving out observation 90)


Fitting model 15 out of 18 (leaving out observation 92)


Fitting model 16 out of 18 (leaving out observation 105)


Fitting model 17 out of 18 (leaving out observation 119)


Fitting model 18 out of 18 (leaving out observation 126)

All pareto_k estimates below user-specified threshold of 0.7. 
Returning loo object.

All pareto_k estimates below user-specified threshold of 0.7. 
Returning loo object.

A compare.loo: 3 × 8 of type dbl
elpd_diffse_diffelpd_loose_elpd_loop_loose_p_loolooicse_looic
basketball_model_normal 0.0000000 0.00000 -863.3841 8.310301 7.005304 0.96533741726.768 16.62060
basketball_model_negbinom -0.5856422 17.37618 -863.9698 16.436069 10.603503 2.29098701727.940 32.87214
basketball_model_poisson-1051.8485090245.43060-1915.2326245.739646351.972863144.01198363830.465491.47929

Interestingly, the normal model performs a tiny, non-significant margin better than the negative binomial model in terms of ELPD. The Poisson model is much worse, probably because it is too certain. It would be interesting to run some deeper investigations here. Let us choose the negative binomial model for further examinations.

Exercise 18.8¶

a)¶

As discussed above, choose the negative binomial model.

Global model:

In [57]:
tidy(basketball_model_negbinom, effects = "fixed", conf.int = TRUE, conf.level = 0.80)
A tibble: 4 × 5
termestimatestd.errorconf.lowconf.high
<chr><dbl><dbl><dbl><dbl>
(Intercept) 3.225905110.0805757353.120141703.3272140
games_played0.077844000.0035174510.073384360.0822994
starterTRUE 0.178520320.0772074820.080377860.2773162
avg_points 0.096894370.0119992550.081447960.1126978

All fixed effects appear to be significant at the 80% level.

$$Y_i \sim Pois(\lambda_i), \quad\text{with } \lambda_i = \exp\left(3.22 + 0.078 X_1 + 0.18 X_2 + 0.097 X_3\right),$$

where $X_1$ denotes the number of games played, $X_2$ is an indicator for whether the team member was used as a starter and $X_3$ denotes the average number of points the player scored.

In [58]:
c(exp(3.12), exp(3.32), exp(0.073), exp(0.082), exp(0.080), exp(0.277), exp(0.081), exp(0.113))
  1. 22.6463796431754
  2. 27.6603505585167
  3. 1.0757305369147
  4. 1.08545580982955
  5. 1.08328706767496
  6. 1.31916637103496
  7. 1.08437089656676
  8. 1.11963193294859

The baseline is 23-28 minutes played in the season. This baseline is increased by 8% per game the player has played in the season, by 8-32% if the player was used as a starter and by 8-12% with each point the player has scored.

Intercepts:

In [59]:
tidy(basketball_model_negbinom, effects = "ran_vals", conf.int = TRUE, conf.level = 0.80) %>% 
    arrange( desc(estimate) ) %>% 
    mutate( baseline_difference=exp(estimate) )
A tibble: 13 × 8
levelgrouptermestimatestd.errorconf.lowconf.highbaseline_difference
<chr><chr><chr><dbl><dbl><dbl><dbl><dbl>
NYLteam(Intercept) 0.07262615780.08199996-0.0072391090.1909947501.0753285
PHOteam(Intercept) 0.06855004990.07984687-0.0082740540.1879755851.0709542
DALteam(Intercept) 0.04974790010.07031646-0.0197106440.1578273791.0510061
LASteam(Intercept) 0.04783049120.07044620-0.0239988920.1589869581.0489928
TOTteam(Intercept) 0.01695884460.06664789-0.0646922010.1317032851.0171035
MINteam(Intercept) 0.00037155440.05849000-0.0815610080.0861718261.0003716
INDteam(Intercept)-0.00490494490.05842614-0.0961762920.0775446700.9951071
SEAteam(Intercept)-0.01718176060.06232905-0.1183007870.0603091300.9829650
CONteam(Intercept)-0.02133088280.05982796-0.1188543800.0510665560.9788950
ATLteam(Intercept)-0.02435055830.06104048-0.1216132480.0491419090.9759435
CHIteam(Intercept)-0.04590613900.07046683-0.1589241030.0284340130.9551316
WASteam(Intercept)-0.05340486970.07664904-0.1715918110.0227426210.9479961
LVAteam(Intercept)-0.07668803620.09071436-0.2116330560.0083598010.9261787

The NYL players have the largest total playing time on average, 7.5% better than global average, and the LVA players have the shortest totatal playing time on average, 7% worse than global average.

Open-ended Exercises 18.9 / 18.10¶

.. are left up to you, the concepts should be clear now.